Introduction

The effects of the COVID-19 pandemic on mental health have become an increasingly recognized issue in public discourse, yet mental health science perspectives remain relatively overlooked in the construction of research priorities and policy agendas related to recovery from the COVID-19 pandemic. Researchers recognize that the psychological and social effects of the COVID-19 pandemic on mental health are pervasive (Holmes & O’Connor et al., 2020) and that, without government support, individuals are at risk of bearing a “psychological scar” (Clemente-Suárez et al., 2020, p. 4) long after the pandemic ends. Therefore, coordination between governments and global research platforms to prioritize understanding mental health outcomes during the COVID-19 pandemic is necessary to mitigate mental health consequences for vulnerable groups under current and future pandemic conditions (Holmes & O’Connor et al., 2020).

The objective of this report is to inform readers about the impacts of COVID-19 on mental health in the United States. Although mental health is a global issue, the decentralized and inconsistent response to COVID-19 in the United States motivates us to explore populations in the United States as a case study for populations in countries implementing similar dynamic responses to the pandemic. The cyclic nature of lockdowns and control measures in the United States means Americans will continue to experience stressors associated with the COVID-19 pandemic. So, we hope that our findings will help advance knowledge about Americans’ vulnerability to negative mental health outcomes so that susceptible groups can be prioritized in government response and assistance programs related to COVID-19 or future pandemics.

To investigate the impact of COVID-19 on mental health in the United States, we explore data from the WHO Coronavirus Dashboard and the United States Census Bureau Household Pulse Survey (HPS). We also provide descriptive analysis of Google search queries related to anxiety and depression to supplement the HPS data and guide our hypotheses. We define our primary question of interest as whether an association exists between the number of confirmed COVID-19 cases and the frequency of indicators, or symptoms, of depressive or anxiety disorder. Furthermore, we define our secondary question of interest as, if an association exists between the number of confirmed COVID-19 cases and the frequency of indicators of depressive and anxiety disorder, whether a specific population in the United States experiences the greatest frequency of symptoms. To produce combined data sets for our analysis, we select the Date_reported, Country, New_cases, Cumulative_cases, New_deaths, and Cumulative_deaths variables from the WHO Coronavirus Dashboard and the Indicator, Group, Time Period, Time Period Label, Time Period Start Date, Time Period End Date, Value, Low CI, High CI, and Quartile Range variables from the HPS.

Prior analyses of HPS data suggest a disproportionate negative impact of COVID-19 stressors on emotional distress among older persons of color. (Bui et al., 2020) and a significant mental health burden stemming from job insecurity among young adults amidst the COVID-19 pandemic (Ganson et al., 2021). Consequently, we hypothesize that the frequency of indicators of depressive or anxiety disorder increase as the new case count associated with COVID-19 grows in the United States. We also expect that, when comparing populations defined by ethnicity, non-White groups will have the greatest frequency of symptoms. When comparing populations defined by age, we expect to find the highest frequency of symptoms related to depressive or anxiety disorder among young adults, but older adults may also be vulnerable to a greater risk of symptoms.

Background

Recent findings on the impacts of COVID-19 on mental well-being motivate our analysis. On a global scale, researchers have analyzed data from Google Trends to test whether COVID-19 and the associated lockdowns implemented in Europe and America led to changes in well-being related topic search terms; the detection of a significant increase in searches for loneliness, worry, and sadness suggests that people’s mental health may have been severely affected by the pandemic and lockdown (Brodeur et al., 2020). Since data from before the pandemic and lockdowns began is necessary to fully assess how the COVID-19 affects population well-being, Brodeur et al. only analyzed data from Google Trends between January 1st 2019 and April 10th 2020 in countries that had introduced a full lockdown by the end of this period (Brodeur et al., 2020). Therefore, we can interpret Brodeur et al.’s finding of a significant increase in search terms related to poor mental well-being as being associated with the pandemic and lockdowns. Brodeur et al.’s findings suggest that we may expect a positive relationship between the number of COVID-19 cases and the frequency indicators of anxiety and depression in the United States when addressing our primary question of interest.

On a national scale, researchers have identified disparities in the degree of negative mental health consequences associated with COVID-19 across age groups and ethnicities. For example, research in the United States, United Kingdom, and Italy consistently shows that younger adults are more vulnerable to increased psychological distress than older adults (Abbott, 2020; Delmastro & Zamariola, 2020), and that within the United States, job insecurity associated with the pandemic might exacerbate symptoms of poor mental health among young people (Ganson et al., 2021). In addition, greater frequencies of negative health outcomes among minority ethnicities may stem from being disproportionately affected by COVID-19 containment policies (Abbott et al., 2021) or from expectations regarding future inability to recover lost economic resources in the future (Bui et al, 2020). Since these findings come from analysis on data from either large, representative samples or the HPS, we expect to find similar disparities across age or ethnicities when exploring our secondary question of interest.

We compiled and filtered data from the WHO COVID-19 Dashboard and the HPS to produce the data set for our model. The WHO Coronavirus Dashboard reflects laboratory-confirmed cases and deaths based upon WHO case definitions (World Health Organization [WHO], 2020). All WHO COVID-19 data represents the date of reporting as opposed to date of symptom onset and counts of new cases and deaths were calculated by subtracting previous cumulative total counts from the current count (WHO, 2020). The HPS was developed to help understand the social and economic impacts of COVID-19 on American households and designed to meet the goal of accurate and timely weekly estimates (Fields et al., 2020). HPS investigators expected low response rates, so they applied a weighting procedure within each state consisting of four adjustments that were applied to the base weights (Fields et al., 2020). Although the HPS produced responses related to many social and economic measures across three different geographic levels (estimates for the 15 largest Metropolitan Statistical Areas, state-level estimates for each of the 50 states and the District of Columbia, and national-level estimates), our predictive analysis is based on national-level estimates of the percentage of respondents who report experiencing symptoms associated with diagnoses of generalized anxiety disorder or major depressive disorder.

Descriptive Analysis

With so many datasets to compare it is important to process them carefully and consistently. The data collected from the WHO on Global COVID-19 deaths and cases is first reduced in scope by filtering out all countries except the United States. The data from the Household Pulse Survey (HPS) is looked at in multiple lights. Since the analysis to follow will be focused on country-wide trends, the HPS data is reduced to only contain the national measure for anxiety or depressive disorder. Later in this section this data will be separated and presented based on different groupings such as race, sex, and state. The data collected from Google Trends has already been reduced to covering the United States and so little pre-processing is required.

The HPS data is presented in the following three visuals by race, sex, and state. To aggregate the values in the data based on the selected groupings the mean was used. This data is already heavily pre-processed, so using the mean is the most reasonable as the data can be assumed to be normally distributed.

In the first figure, the racial groups with the lowest survey responses indicating Anxiety or Depression, by almost 10%, are Non-Hispanic white and Non-Hispanic Asian. In an analysis based on the sex, female respondents indicated anxiety or depression symptoms approximately 10% more than male respondents. The second figure shows a heat map of the United States where more yellow coloring represents low percentages of responses indicating anxiety or depression and more blue colors represent higher percentages of responses. It appears the northern-middle part of the country has lowest percentages of responses indicating anxiety or depression, by almost 10%.

To present these several datasets on the same time line, the time frame was restricted to the 2020 calendar year. Since all the datasets except for the WHO COVID-19 cases and deaths are recorded in constant time periods that are greater than a day, they were considered to be constant during their respective sampling period. For example, the Google trends data on Anxiety is recorded every week. The value recorded every week is assigned to every day of that week. This allows for simple visual comparisons as shown below.

Strangely, the month of May has a day with a negative number of deaths, which isn’t possible. The maximum number of deaths in a day is exactly 5,000 which is highly unlikely. It is possible that there were mistakes made in the data collection process and these instances were corrections. Further investigation outside this report is warranted by this strange behavior, but for now the first error is assumed to be a correction of an over-count and the second error is assumed to be an estimate of that day’s number of deaths due to COVID. Both of these can be considered a part of the noise that naturally exists in the data.

Predictive/Inferential Analysis

In order to determine if there was any relationship between new COVID-19 cases or new COVID-19 deaths and values reported for symptoms of Anxiety Disorder or Depressive Disorder a linear model was fit. A linear model is appropriate for these analyses because the model is explaining a continuous variable, \(Y\), as a mathematical function of one or more variables, \(X\), so that we can use this model to predict the \(Y\) when only the \(X\) is known. The linear model used in this analysis is defined as \(Y = \beta_1 + \beta_2X + \epsilon\) where: \(Y\) represents the value reported for symptoms of Anxiety Disorder or Depressive Disorder
\(\beta_1\) represents the intercept \(\beta_2\) represents the slope \(\epsilon\) represents the error term (the part of \(Y\) that the regression model is unable to explain)

The \(H_0\) is that \(B_2=0\) and the \(H_A\) is that \(B_2\ne0\) meaning that there is a linear relationship between \(X\) and \(Y\).

In order to create the linear model first the dates across both datasets needed to be standardized. The CDC Depression and Anxiety data has 24 datapoints for distinct time periods. The WHO COVID-19 data has 427 datapoints for distinct days. The days in the WHO COVID-19 that match the time periods in the CDC data were then compiled and averages were taken for new cases and new deaths. Doing this controls for the time series variable in the data.

First plots were examined showing new COVID-19 cases or new COVID-19 deaths with values reported for symptoms of Anxiety Disorder or Depressive Disorder, this was done in order to visually assess if a linear model would be an appropriate fit to this data. After the plots were examined it was concluded that a linear model would provide, at least initially, a suitable fit to the data presented.

The linear models built to examine new COVID-19 cases or new COVID-19 deaths and values reported for symptoms of Anxiety Disorder or Depressive Disorder assume that the relationship between \(X\) and \(Y\) is linear and that for any fixed value of \(X\), \(Y\) is normally distributed. This model also assumes homoscedasticity, that the variance of residuals is the same for any value of \(X\), and independence, that each observation is independent of each other. In order to put the values for \(X\) and \(Y\) on a similar scale the log of \(Y\) was used in the model.

## 
## Call:
## lm(formula = log(cases) ~ anxiety_value, data = compiled_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52978 -0.26460 -0.00436  0.22306  0.67797 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.65076    0.90521   0.719     0.48    
## anxiety_value  0.26883    0.02369  11.349 1.15e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3119 on 22 degrees of freedom
## Multiple R-squared:  0.8541, Adjusted R-squared:  0.8475 
## F-statistic: 128.8 on 1 and 22 DF,  p-value: 1.151e-10
## 
## Call:
## lm(formula = log(deaths) ~ anxiety_value, data = compiled_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9003 -0.4282 -0.1433  0.4638  0.9418 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    4.68769    1.55852   3.008  0.00648 **
## anxiety_value  0.06283    0.04078   1.541  0.13769   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.537 on 22 degrees of freedom
## Multiple R-squared:  0.09737,    Adjusted R-squared:  0.05634 
## F-statistic: 2.373 on 1 and 22 DF,  p-value: 0.1377

The reported summary for the model relating \(Y\), number of new COVID-19 cases, to \(X\) the value reported in the CDC Depression and Anxiety data, shows a positive linear relationship that is significant at p value = 0.05, from this information we can reject the null hypothesis, \(H_0\) which states that \(B_2=0\).

The model relating \(Y\), number of new COVID-19 deaths, to \(X\) the value reported in the CDC Depression and Anxiety data, shows a positive linear relationship that is not significant at p value = 0.05, from this information we can accept the null hypothesis, \(H_0\), which states that \(B_2=0\). The rest of the analysis will focus on the relationship between new COVID-19 cases and the value reported in the CDC Depression and Anxiety data because that relationship was significant.

The conclusion that can be drawn from this model is that an association exists between the number of confirmed COVID-19 cases and the frequency of indicators, or symptoms, of Depressive or Anxiety disorder, this answers our primary question of interest for this report. This conclusion, derived from the data from the WHO Coronavirus Dashboard and the United States Census Bureau Household Pulse Survey (HPS) confirms the information presented by Holmes and O’Connor (2020) that there are effects from the COVID-19 pandemic on mental health. This information is also consistent with the Google Trends data that showed a significant increase in search terms related to poor mental well-being as associated with the global pandemic (Brodeur et al., 2020, p.1). The information here allows us to address our secondary question of interest, if an association exists between the number of COVID-19 cases and the frequency of indicators of depressive and anxiety disorder, whether a specific population in the United States experiences the greatest frequency of symptoms. The populations examined here were race groups (Hispanic or Latino, Non-Hispanic Asian, Non-Hispanic black, Non-Hispanic white, and Non-Hispanic other races and multiple races) and age groups (18-29, 30-39, 40-49, 50-59, 60-69, 70-79, and 80 years and above). Examining levels of anxiety and depression with an ANOVA between race groups and age groups will show if the pandemic has affected the anxiety and depression levels of these race groups and age groups uniformly or disparately.

In order to determine if there were any differences in values reported for symptoms of Anxiety Disorder or Depressive Disorder across race group and across age group an ANOVA model was fit (one model for each). An ANOVA model is appropriate for this analysis because there is one categorical explanatory variable. The two-way ANOVA model is defined as \(Y_{ij} = \mu + \alpha_{i} + \epsilon_{ij}\) where: \(Y_{ij}\) represents the expected value for reported symptoms of Anxiety Disorder or Depressive Disorder \(i\) represents the race group or age group \(j\) represents the \(j\)th observation for group \(i\) \(\mu{..}\) represents the expected mean over all race groups or age groups $ _{i}$ represents the mean adjustment for group \(i\) of race group or age group
\(\epsilon_{ij}\) represents the error for race group or age group \(i\), observation \(j\)

This model assumes that the mean of the response variable is influenced additively by the experimental factor, the errors of the model are independent, the errors have the same variance, and the errors are normally distributed. For this model that means that the value for reported symptoms of Anxiety Disorder or Depressive Disorder are influenced by race group or age group. For each observation, \(n\), in this model there is unexplained variation in the reported symptoms of Anxiety Disorder or Depressive Disorder from the calculated reported symptoms of Anxiety Disorder or Depressive Disorder. These deviations are assumed to be independent of one another and normally distributed around a mean of 0.

The null hypothesis for the race group model states that, \(H_0 : \alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = \alpha_5 = 0\), the alternative hypothesis states that \(H_a\) : not all \(\alpha\)s are zero.

##   Indicator            Group              State             Subgroup        
##  Length:120         Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Phase            Time.Period    Time.Period.Label  Time.Period.Start.Date
##  Length:120         Min.   : 1.00   Length:120         Length:120            
##  Class :character   1st Qu.: 6.75   Class :character   Class :character      
##  Mode  :character   Median :12.50   Mode  :character   Mode  :character      
##                     Mean   :12.50                                            
##                     3rd Qu.:18.25                                            
##                     Max.   :24.00                                            
##  Time.Period.End.Date     Value           Low.CI         High.CI     
##  Length:120           Min.   :27.40   Min.   :24.40   Min.   :29.50  
##  Class :character     1st Qu.:35.60   1st Qu.:33.30   1st Qu.:37.08  
##  Mode  :character     Median :40.25   Median :38.10   Median :42.80  
##                       Mean   :39.91   Mean   :37.29   Mean   :42.56  
##                       3rd Qu.:44.02   3rd Qu.:41.00   3rd Qu.:47.40  
##                       Max.   :52.60   Max.   :48.40   Max.   :56.80  
##  Confidence.Interval Quartile.Range    
##  Length:120          Length:120        
##  Class :character    Class :character  
##  Mode  :character    Mode  :character  
##                                        
##                                        
## 
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Subgroup      4   3130   782.4   81.06 <2e-16 ***
## Residuals   115   1110     9.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## $emmeans
##  Subgroup                                     emmean    SE  df lower.CL
##  Hispanic or Latino                             43.3 0.634 115     42.0
##  Non-Hispanic Asian, single race                32.3 0.634 115     31.1
##  Non-Hispanic black, single race                40.9 0.634 115     39.6
##  Non-Hispanic white, single race                36.3 0.634 115     35.1
##  Non-Hispanic, other races and multiple races   46.8 0.634 115     45.6
##  upper.CL
##      44.5
##      33.6
##      42.1
##      37.6
##      48.1
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast                                                                          
##  Hispanic or Latino - (Non-Hispanic Asian, single race)                            
##  Hispanic or Latino - (Non-Hispanic black, single race)                            
##  Hispanic or Latino - (Non-Hispanic white, single race)                            
##  Hispanic or Latino - (Non-Hispanic, other races and multiple races)               
##  (Non-Hispanic Asian, single race) - (Non-Hispanic black, single race)             
##  (Non-Hispanic Asian, single race) - (Non-Hispanic white, single race)             
##  (Non-Hispanic Asian, single race) - (Non-Hispanic, other races and multiple races)
##  (Non-Hispanic black, single race) - (Non-Hispanic white, single race)             
##  (Non-Hispanic black, single race) - (Non-Hispanic, other races and multiple races)
##  (Non-Hispanic white, single race) - (Non-Hispanic, other races and multiple races)
##  estimate    SE  df t.ratio p.value
##     10.95 0.897 115  12.215 <.0001 
##      2.42 0.897 115   2.699 0.0602 
##      6.95 0.897 115   7.754 <.0001 
##     -3.54 0.897 115  -3.949 0.0013 
##     -8.53 0.897 115  -9.515 <.0001 
##     -4.00 0.897 115  -4.460 0.0002 
##    -14.50 0.897 115 -16.164 <.0001 
##      4.53 0.897 115   5.055 <.0001 
##     -5.96 0.897 115  -6.649 <.0001 
##    -10.50 0.897 115 -11.703 <.0001 
## 
## P value adjustment: tukey method for comparing a family of 5 estimates

The results of the ANOVA model comparing reported symptoms of Anxiety Disorder or Depressive Disorder between race groups indicates that there are significant differences between race groups, the null hypothesis, \(H_0\), is rejected. Non-Hispanic other races and multiple races had the highest values reported for symptoms of Anxiety Disorder or Depressive Disorder, followed by Hispanic or Latino which was followed by, by not significantly different than, Non-Hispanic black. The next highest rates were in Non-Hispanic white, then Non-Hispanic Asian. All differences between race groups were significant aside from the one mentioned. In order to better interpret this outcome more information is needed on the racial make up behind the Non-Hispanic other races and multiple races group. Overall the results seen here are consistent with the current literature on this topic as well as with our initial hypothesis which stated that non white groups will have the greatest frequency of symptoms (Bui et al., 2020, p. 262).

The null hypothesis for the age group model states that, \(H_0 : \alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = \alpha_5 = \alpha_6 = \alpha_7 = 0\), the alternative hypothesis states that \(H_a\) : not all \(\alpha\)s are zero.

##   Indicator            Group              State             Subgroup        
##  Length:168         Length:168         Length:168         Length:168        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Phase            Time.Period    Time.Period.Label  Time.Period.Start.Date
##  Length:168         Min.   : 1.00   Length:168         Length:168            
##  Class :character   1st Qu.: 6.75   Class :character   Class :character      
##  Mode  :character   Median :12.50   Mode  :character   Mode  :character      
##                     Mean   :12.50                                            
##                     3rd Qu.:18.25                                            
##                     Max.   :24.00                                            
##  Time.Period.End.Date     Value           Low.CI         High.CI     
##  Length:168           Min.   :13.90   Min.   :10.80   Min.   :17.30  
##  Class :character     1st Qu.:26.38   1st Qu.:23.05   1st Qu.:28.80  
##  Mode  :character     Median :35.80   Median :34.05   Median :37.95  
##                       Mean   :35.15   Mean   :32.91   Mean   :37.50  
##                       3rd Qu.:43.00   3rd Qu.:41.17   3rd Qu.:44.62  
##                       Max.   :58.70   Max.   :55.80   Max.   :61.50  
##  Confidence.Interval Quartile.Range    
##  Length:168          Length:168        
##  Class :character    Class :character  
##  Mode  :character    Mode  :character  
##                                        
##                                        
## 
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Subgroup      6  18331  3055.2   307.9 <2e-16 ***
## Residuals   161   1597     9.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## $emmeans
##  Subgroup           emmean    SE  df lower.CL upper.CL
##  18 - 29 years        51.7 0.643 161     50.4     53.0
##  30 - 39 years        43.7 0.643 161     42.4     44.9
##  40 - 49 years        40.1 0.643 161     38.8     41.4
##  50 - 59 years        37.0 0.643 161     35.8     38.3
##  60 - 69 years        30.1 0.643 161     28.8     31.4
##  70 - 79 years        23.2 0.643 161     22.0     24.5
##  80 years and above   20.3 0.643 161     19.0     21.5
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast                             estimate    SE  df t.ratio p.value
##  (18 - 29 years) - (30 - 39 years)        8.02 0.909 161  8.817  <.0001 
##  (18 - 29 years) - (40 - 49 years)       11.59 0.909 161 12.744  <.0001 
##  (18 - 29 years) - (50 - 59 years)       14.65 0.909 161 16.116  <.0001 
##  (18 - 29 years) - (60 - 69 years)       21.58 0.909 161 23.737  <.0001 
##  (18 - 29 years) - (70 - 79 years)       28.47 0.909 161 31.307  <.0001 
##  (18 - 29 years) - 80 years and above    31.43 0.909 161 34.570  <.0001 
##  (30 - 39 years) - (40 - 49 years)        3.57 0.909 161  3.927  0.0024 
##  (30 - 39 years) - (50 - 59 years)        6.64 0.909 161  7.300  <.0001 
##  (30 - 39 years) - (60 - 69 years)       13.57 0.909 161 14.920  <.0001 
##  (30 - 39 years) - (70 - 79 years)       20.45 0.909 161 22.491  <.0001 
##  (30 - 39 years) - 80 years and above    23.42 0.909 161 25.753  <.0001 
##  (40 - 49 years) - (50 - 59 years)        3.07 0.909 161  3.373  0.0159 
##  (40 - 49 years) - (60 - 69 years)       10.00 0.909 161 10.993  <.0001 
##  (40 - 49 years) - (70 - 79 years)       16.88 0.909 161 18.563  <.0001 
##  (40 - 49 years) - 80 years and above    19.85 0.909 161 21.826  <.0001 
##  (50 - 59 years) - (60 - 69 years)        6.93 0.909 161  7.621  <.0001 
##  (50 - 59 years) - (70 - 79 years)       13.81 0.909 161 15.191  <.0001 
##  (50 - 59 years) - 80 years and above    16.78 0.909 161 18.453  <.0001 
##  (60 - 69 years) - (70 - 79 years)        6.88 0.909 161  7.570  <.0001 
##  (60 - 69 years) - 80 years and above     9.85 0.909 161 10.833  <.0001 
##  (70 - 79 years) - 80 years and above     2.97 0.909 161  3.263  0.0224 
## 
## P value adjustment: tukey method for comparing a family of 7 estimates

The results of the ANOVA model comparing reported symptoms of Anxiety Disorder or Depressive Disorder between age groups indicates that there are significant difference between all age groups, the null hypothesis, \(H_0\), is rejected. Age group 18-29 years had the highest reported values for symptoms of Anxiety Disorder or Depressive Disorder followed by 30-39 years which was followed by 40-49 years. The four age groups following (50-59 years, 60-69 years, 70-79 years, and 80 years and above) had reported symptoms of Anxiety Disorder or Depressive Disorder that decreased according to increasing age. The analysis is consistent with research that shows that younger adults are more vulnerable to symptoms of anxiety and depression than older adults (Abbott, 2020, p. 194; Delmastro & Zamariola, 2020, p. 8). This result is also consistent with our initial hypothesis which stated that young age groups will have the greatest frequency of symptoms, however, this result is not consistent with our initial hypothesis in that older adults may also be more vulnerable to symptoms of anxiety and depression.

Sensitivity Analysis

For our linear model we want to check the assumption of the linear regression model including: * Linear Relationship * Homoscedasticity * Independence * Normality

The raw case data in relation to anxiety value does not appear to be linear. The lm() line calculation does not fit all of the data evenly and has some tailing at the high end of cases and anxiety values. While there does appear to be a correlation, the relationship does not appear to be linear.

Upon an initial visualization to check for a linear relationship, we see that there is nothing to indicate otherwise. The data does not appear to be in any other form than linear. Log-transorming the data allows us to perform a linear regression as we cannot reject the assumption that the relationship is linear after tranforming the data.

In the residual plot, we want to check that the fitted residual line is mostly horizontal. We want our residual plot to have randomly distributed error terms with no clear pattern that are relatively close to this fitted line. In the model that relates log-transformed case counts and anxiety rates in the survey, we see this line is not perfectly horizontal, but nothing concerning to throw out the entire model. Some of the data is not spread evenly and there may be a couple outliers.

In the Normal Q-Q plot, we want to see that most of the data points are on the theoretical dashed line incicating normality. Our data here has a majority of the points near this line and has some tailing on either end of the distribution but nothing major. This would indicate normal distribution.

The Scale-Location plot, similar to the residual plot, wants a fitted line as horizontal as possible. Here we see similar results with no major deviations from a horizontal fitted line and a similar spread which should be looked into further via Cook’s distance to determine outliers.

In the Residuals vs leverage plot with Cook’s distance, we see that there are no points that pass the Cook’s distance line. Since no points are outside the line our model assumptions hold.

For the ANOVA models used in the report, we want to check for the following indicators that the assumptions do not hold.

To check for any of these, we will look at the same plots as we did for the linear model.

For the race ANOVA model, we see that the residuals are evenly and seemingly randomly distributed on either side of the regression line. Nothing concerning here.

The normal Q-Q plot does contain some tailing at either end of the sample which is mildly concerning. However, this tailing is not significant enough to conclude that our model assumption of normally distributed error terms do not hold.

Similar to the first residual plot with the standardized residuals, applying the square root function to the residuals does not show anything concerning.

While the Cook’s distance plot, upon first glance, shows a couple potential influential points, a quick look at the Y-Axis shows that none of these values are disturbingly high.

Overall, the ANOVA race model’s assumption appear to hold up well. The error terms appear to be independent and normally distributed. No major outliers appear to exist and the variance of error terms look equal.

Next, we will look at the plots for the ANOVA age model to help determine if our model assumptions hold.

Our residual vs fitted plot shows again a nearly flat line which indicates that the error terms are independent and randomly distributed.

The Normal Q-Q plot shows some tailing on either side but nothing alarming. The assumption that the error terms are normally distributed appears to hold.

After standardizing the residuals, the plot’s line looks less flat than the unstandardized residuals but overall nothing concerning.

The Cook’s distance plot also shows, with a maximum of approimately 0.04, that no points are overly influential.

To summarize the sensitivity analysis suggestion, our assumptions for the linear model

Insights and Suggestions for Future Analysis

All our assumptions appear to hold for the model that related log-transformed case counts and anxiety rates in the country. While some data points come close to the Cook’s distance line of 0.5 and our residual plot is not as great as it could be, there is not strong enough evidence in these plots to reject the model and it’s assumptions.

One way to be more conclusive about the assumptions is to have more data points, the data we collected is only weekly and the case-counts were aggregated weekly to allow for a proper comparison. More data points may be helpful, but weekly case data tends to be similar to daily case data and may not end up having that much of an impact on the conclusions. However, more data points would allow the model validation to be improved. In addition a further analysis could be completed to see how values for Anxiety Disorder and Depression Disorder effect different populations such as looking at differences accross age groups or ethnicities.

Conclusion

Our analysis shows a relationship between the number of COVID-19 cases and indicators of depression and anxiety among people living in the United States. We aggregated the daily case counts from the WHO Coronavirus Dashboard with the frequency of anxiety and depression symptoms reported in the HPS data set, and then used this aggregated data to create a linear model to analyze the relationship between case rates and the rate of anxiety or depression symptoms. We found a relationship between these two variables with an extremely low p-value and a high adjusted \(R^2\) of 0.8475. The model assumptions appear to hold, so the low p-value suggests that we may conclusively reject the null hypothesis of no relationship existing between COVID-19 cases and indicators of anxiety or depression. In addition, the high Adjusted \(R^2\) suggests that COVID-19 cases and symptoms of depression or anxiety are highly correlated.

Caveats of our analysis include not being able to infer causality between COVID-19 cases and symptoms of depression or anxiety and not measuring the effects of potential confounders. Barriers to making causal statements arise from the lack of information produced by the HPS on respondents’ employment status, social support, family and living arrangements, pre-existing mental health symptoms and diagnoses, and community characteristics, all of which may be related to emotional distress (Bui et al., 2020; Ganson et al., 2021). Potential confounders that may be important to address in future analysis are the effects of season or pre-existing mental health conditions. For example, weather fluctuations might influence both COVID-19 transmission and the prevalence of seasonal affective disorder, so future analysis that incorporates seasonal effects such as temperature and hours of daylight might produce more precise results. Additionally, chronic stress associated with pre-existing mental health conditions weakens individuals’ immune systems and might lead to more frequent symptoms of depression or anxiety; therefore, future analysis incorporating effects of pre-existing mental health conditions such as biological or psychological markers of stress might also produce more precise results.

Depression and anxiety have been studied extensively and many factors are known to contribute to their development. Some of these factors, such as staying indoors extensively and loss of employment, are byproducts of the government restrictions and lockdowns due to COVID-19. Other factors such as the deaths of loved ones are unfortunately inevitable consequences of the COVID-19 pandemic. These external factors may further obfuscate the true cause of the increased frequency of depression and anxiety symptoms since the start of the pandemic, so it is important to recognize that we cannot conclude that COVID-19 cases cause increased depression or anxiety rates. However, we can acknowledge that the COVID-19 pandemic has introduced multifaceted challenges to healthcare and policy, and that mental health should not be ignored in efforts to understand and solve them.

References

Abbott, A. (2021). COVID’s mental-health toll: how scientists are tracking a surge in depression. Nature, 590(7845), 194–195. https://doi.org/10.1038/d41586-021-00175-z

Brodeur, A., Clark, A. E., Fleche, S., & Powdthavee, N. (2021). COVID-19, lockdowns and well-being: Evidence from Google Trends. Journal of Public Economics, 193, 104346. https://doi.org/10.1016/j.jpubeco.2020.104346

Bui, C. N., Peng, C., Mutchler, J. E., & Burr, J. A. (2020). Race and Ethnic Group Disparities in Emotional Distress Among Older Adults During the COVID-19 Pandemic. The Gerontologist, 61(2), 262–272. https://doi.org/10.1093/geront/gnaa217

Clemente-Suárez, V. J., Dalamitros, A. A., Beltran-Velasco, A. I., Mielgo-Ayuso, J., & Tornero-Aguilera, J. F. (2020). Social and Psychophysiological Consequences of the COVID-19 Pandemic: An Extensive Literature Review. Frontiers in Psychology, 11, 1–15. https://doi.org/10.3389/fpsyg.2020.580225

Delmastro, M., & Zamariola, G. (2020). Depressive symptoms in response to COVID-19 and lockdown: a cross-sectional study on the Italian population. Scientific Reports, 10(1), 1–10. https://doi.org/10.1038/s41598-020-79850-6

Fields JF, Hunter-Childs J, Tersine A, Sisson J, Parker E, Velkoff V, Logan C, and Shin H. Design and Operation of the 2020 Household Pulse Survey, 2020. U.S. Census Bureau. Forthcoming. Ganson, K. T., Tsai, A. C., Weiser, S. D., Benabou, S. E., & Nagata, J. M. (2021). Job Insecurity and Symptoms of Anxiety and Depression Among U.S. Young Adults During COVID-19. Journal of Adolescent Health, 68(1), 53–56. https://doi.org/10.1016/j.jadohealth.2020.10.008

Holmes, E. A., O’Connor, R. C., Perry, V. H., Tracey, I., Wessely, S., Arseneault, L., Ballard, C., Christensen, H., Cohen Silver, R., Everall, I., Ford, T., John, A., Kabir, T., King, K., Madan, I., Michie, S., Przybylski, A. K., Shafran, R., Sweeney, A., … Bullmore, E. (2020). Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science. The Lancet Psychiatry, 7(6), 547–560. https://doi.org/10.1016/s2215-0366(20)30168-1

WHO COVID-19 Dashboard. Geneva: World Health Organization, 2020. Available online: https://covid19.who.int/

Appendix

# Setup this markdown file:
knitr::opts_chunk$set(
  fig.pos = 'H',
  warning = FALSE,
  message = FALSE,
  eval = TRUE, 
  echo = FALSE, 
  fig.align='center'
)
# Include all necessary libraries
library(tidyverse)
library(usmap)
library(reshape2)
library(lubridate)
library(rstatix)
library(gt)
# Import WHO COVID Data, CDC Depression and Anxiety Data, and Google Trends
WHO.path <- file.path("..","Data","WHO-COVID-19-global-data.csv")
CDC.path <- file.path("..","Data","CDC_Anxiety.csv")
GOOGLE.path <- file.path("..","Data","Google_US_Depression_Anxiety.csv")
covid <- read.csv(file = WHO.path, fileEncoding="UTF-8-BOM")
indicators <- read.csv(file = CDC.path, header=T)
GOOGLE <- read.csv(file = GOOGLE.path)

# Convert date column to POSIXct
GOOGLE$Week <- as.POSIXct(GOOGLE$Week)
GOOGLE <- subset(GOOGLE, Week >= as.POSIXct("2020-01-01"))

# subset covid data to include only US data 
covid_US = subset(covid, Country == "United States of America") %>% select(Date_reported,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths)
covid_US$Date_reported <- as.POSIXct(covid_US$Date_reported, format = "%Y-%m-%d")

# subset CDC anxiety data to include only the measures for anxiety or depressive disorder (combined) and the national measure 
indicators_US = subset(indicators, Indicator == "Symptoms of Anxiety Disorder or Depressive Disorder" & 
                         Group == "National Estimate" ) %>%     select(Time.Period,Time.Period.Label,Time.Period.Start.Date,Time.Period.End.Date,Value,Low.CI,High.CI,Quartile.Range)

# remove na data from the indicators leaving 24 values for 24 distinct time periods  
indicators_US.narm <- indicators_US %>% filter(!is.na(Value))

# Convert start/end date to standard format.
indicators_US.narm$Time.Period.Start.Date <- as.POSIXct(indicators_US.narm$Time.Period.Start.Date, format = "%m/%d/%Y")
indicators_US.narm$Time.Period.End.Date <- as.POSIXct(indicators_US.narm$Time.Period.End.Date, format = "%m/%d/%Y")
# Present Depression or Anxiety Information by Race
indicators.race <- subset(indicators, Indicator == "Symptoms of Anxiety Disorder or Depressive Disorder" & 
                         Group == "By Race/Hispanic ethnicity", c(Subgroup, Value)) %>% filter(!is.na(Value))
indicators.race$Subgroup <- as.character(indicators.race$Subgroup)
indicators.race.means <- aggregate(indicators.race, list(indicators.race$Subgroup), mean)
ggplot(indicators.race.means, aes(x=Value, y=Group.1, )) + 
  geom_bar(aes(fill=Group.1), stat = "identity", position = "dodge") +
  theme(legend.position = "none") +
  labs(title = "Depression or Anxiety Symptoms by Race",
       subtitle = "Average 2020 Household Pulse Survey",
       x = "Average % of Survey Responses Indicating Anxiety or Depression", 
       y = "")

# Present Depression or Anxiety Information by Sex
indicators.sex <- subset(indicators, Indicator == "Symptoms of Anxiety Disorder or Depressive Disorder" & 
                         Group == "By Gender",c(Subgroup, Value))  %>% filter(!is.na(Value))
indicators.sex.means <- aggregate(indicators.sex, list(indicators.sex$Subgroup), mean)
# ggplot(indicators.sex.means, aes(x=Value, y=Group.1)) + 
#  geom_bar(aes(fill=Group.1), stat = "identity", position = "dodge") +
#  theme(legend.position = "none") +
#  labs(title = "Depression or Anxiety Symptoms by Sex", 
#       subtitle = "Average 2020 Household Pulse Survey",
#       x = "Average % of Survey Responses Indicating Anxiety or Depression", 
#       y = "")

# Present Depression or Anxiety Information by State
indicators.state <- subset(indicators, Indicator == "Symptoms of Anxiety Disorder or Depressive Disorder" & 
                         Group == "By State", c(Subgroup,Value))  %>% filter(!is.na(Value))
indicators.state.means <- aggregate(indicators.state, list(indicators.state$Subgroup), mean) %>% select(Group.1,Value)
indicators.state.means$Group.1 <- usmap::fips(indicators.state.means$Group.1)
names(indicators.state.means) <- c('fips','Value')
usmap::plot_usmap(data=indicators.state.means, values="Value") +
  theme(legend.position = 'right') +
  labs(title = "Depression or Anxiety Symptoms by State", subtitle = "Average 2020 Household Pulse Survey") +
  scale_fill_continuous(low = "yellow", high = "blue", name = "Anxiety or Depression Percentage")
# Fill in dates of indicators and GOOGLE
HPS <- list()
for (i in 1:(dim(indicators_US.narm)[1]-1)) {
  fill.dates <- seq(indicators_US.narm[i,'Time.Period.Start.Date'], indicators_US.narm[i+1,'Time.Period.Start.Date'] - 1*60*60*24, by="days")
  fill.values <- c(rep(1,length(fill.dates)))*indicators_US.narm[i,'Value']
  fill.dates.df <- data.frame(fill.dates,fill.values)
  HPS[[i]] <- fill.dates.df
}
HPS.full <- bind_rows(HPS)

Trends.Anxiety <- list()
Trends.Depression <- list()
for (i in 1:(dim(GOOGLE)[1]-1)) {
  fill.dates <- seq(GOOGLE[i,'Week'], GOOGLE[i+1,'Week'] - 1*60*60*24, by="days")
  fill.anxiety.values <- c(rep(1,length(fill.dates)))*GOOGLE[i,'Anxiety...United.States.']
  fill.depression.values <- c(rep(1,length(fill.dates)))*GOOGLE[i,'Depression...United.States.']
  fill.dates.df <- data.frame(fill.dates,fill.anxiety.values)
  Trends.Anxiety[[i]] <- fill.dates.df
  fill.dates.df <- data.frame(fill.dates,fill.depression.values)
  Trends.Depression[[i]] <- fill.dates.df
}
Trends.Anxiety.Full <- bind_rows(Trends.Anxiety)
Trends.Depression.Full <- bind_rows(Trends.Depression)

# Combine into one dataframe
all.covid <- covid_US %>% subset(Date_reported >= as.POSIXct("2020-04-23") & Date_reported < as.POSIXct("2021-01-01"),c(Date_reported,New_cases,New_deaths))
all.Anxiety <- Trends.Anxiety.Full %>% subset(fill.dates >= as.POSIXct("2020-04-23") & fill.dates < as.POSIXct("2021-01-01"))
all.Depression <- Trends.Depression.Full %>% subset(fill.dates >= as.POSIXct("2020-04-23") & fill.dates < as.POSIXct("2021-01-01"))
all.HPS <- HPS.full %>% subset(fill.dates >= as.POSIXct("2020-04-23") & fill.dates < as.POSIXct("2021-01-01"))
all.df <- data.frame(all.covid,all.Anxiety$fill.anxiety.values,all.Depression$fill.depression.values,all.HPS$fill.values)
names(all.df) <- c("Date","New Cases","New Deaths","Anxiety Google Trends","Depression Google Trends","Household Pulse Survey")
# Show daily deaths, cases, anxiety|depression, and google trends overtime
all.df.m <- melt(all.df %>% select(1:3), "Date")
# New_cases,New_deaths,all.Anxiety.fill.anxiety.values,all.Depression.fill.depression.values,all.HPS.fill.values
#options(repr.plot.width = 8, repr.plot.height = 0.75)
ggplot(all.df.m, aes(Date,value,colour = variable)) + 
  geom_point() + 
  facet_wrap(~ variable, ncol=1,scales = "free_y") +
  labs(title = "COVID-19 Deaths and Cases", 
       subtitle = "WHO Data in 2020",
       y = "") + 
  theme(text = element_text(size = 12), element_line(size = 0.2))


percentages <- data.frame(all.covid$Date,all.Anxiety$fill.anxiety.values,all.Depression$fill.depression.values,all.HPS$fill.values)
names(percentages) <-c("Date","Anxiety Google Trends","Depression Google Trends","Household Pulse Survey")
percent.m <- melt(percentages,"Date")

ggplot(percent.m, aes(Date,value,colour = variable)) +
  geom_point() + 
  labs(title = "HPS Results and Google Trends", 
       subtitle = "HPS and Google Trends on 'Anxiety' and 'Depression' Topics in 2020",
       y = "% of responders indicating symptoms and Google's weighted %") + 
  theme(text = element_text(size = 12), element_line(size = 0.2))
covid.US.Months <- data.frame(covid_US) %>% subset(Date_reported >= as.POSIXct("2020-01-01") & Date_reported < as.POSIXct("2021-01-01"),c(Date_reported,New_cases,New_deaths))
covid.US.Months$Date_reported <- format(covid.US.Months$Date_reported, "01-%m-%Y")
covid.US.Months.stats.cases <- covid.US.Months %>% 
                                select(Date_reported,New_cases) %>%
                                group_by(Date_reported) %>%
                                get_summary_stats(type='common')
covid.US.Months.stats.deaths <- covid.US.Months %>% 
                                  select(Date_reported,New_deaths) %>%
                                  group_by(Date_reported) %>%
                                  get_summary_stats(type='common')
# covid.US.Months.stats.cases %>% 
#   select('Date_reported','min','max','median','mean','sd') %>%
#   gt() %>%
#   tab_header(
#     title = "2020 COVID-19 Cases",
#     subtitle = "Values per day of each month"
#   ) %>%
#   fmt_date(
#     columns = vars(Date_reported),
#     date_style = 11
#   ) %>%
#   fmt_number(
#     columns = vars(min,max,median,mean,sd),
#     decimals = 0
#   )
# 
# covid.US.Months.stats.deaths %>% 
#   select('Date_reported','min','max','median','mean','sd') %>%
#   gt() %>%
#   tab_header(
#     title = "2020 COVID-19 Deaths",
#     subtitle = "Values per day of each month"
#   ) %>%
#   fmt_date(
#     columns = vars(Date_reported),
#     date_style = 11
#   ) %>%
#   fmt_number(
#     columns = vars(min,max,median,mean,sd),
#     decimals = 0
#   )
dates1 = covid_US[112:124,]
dates2 = covid_US[126:131,]
dates3 = covid_US[133:138,]
dates4 = covid_US[140:145,]
dates5 = covid_US[147:152,]
dates6 = covid_US[154:159,]
dates7 = covid_US[161:166,]
dates8 = covid_US[168:173,]
dates9 = covid_US[175:180,]
dates10 = covid_US[182:187,]
dates11 = covid_US[189:194,]
dates12 = covid_US[196:201,]
dates13 = covid_US[230:242,]
dates14 = covid_US[244:256,]
dates15 = covid_US[258:270,]
dates16 = covid_US[272:284,]
dates17 = covid_US[286:298,]
dates18 = covid_US[300:312,]
dates19 = covid_US[314:326,]
dates20 = covid_US[328:340,]
dates21 = covid_US[342:354,]
dates22 = covid_US[370:382,]
dates23 = covid_US[384:396,]
dates24 = covid_US[398:410,]

dates1_cases = mean(dates1$New_cases)
dates1_deaths = mean(dates1$New_deaths)

dates2_cases = mean(dates2$New_cases)
dates2_deaths = mean(dates2$New_deaths)

dates3_cases = mean(dates3$New_cases)
dates3_deaths = mean(dates3$New_deaths)

dates4_cases = mean(dates4$New_cases)
dates4_deaths = mean(dates4$New_deaths)

dates5_cases = mean(dates5$New_cases)
dates5_deaths = mean(dates5$New_deaths)

dates6_cases = mean(dates6$New_cases)
dates6_deaths = mean(dates6$New_deaths)

dates7_cases = mean(dates7$New_cases)
dates7_deaths = mean(dates7$New_deaths)

dates8_cases = mean(dates8$New_cases)
dates8_deaths = mean(dates8$New_deaths)

dates9_cases = mean(dates9$New_cases)
dates9_deaths = mean(dates9$New_deaths)

dates10_cases = mean(dates10$New_cases)
dates10_deaths = mean(dates10$New_deaths)

dates11_cases = mean(dates11$New_cases)
dates11_deaths = mean(dates11$New_deaths)

dates12_cases = mean(dates12$New_cases)
dates12_deaths = mean(dates12$New_deaths)

dates13_cases = mean(dates13$New_cases)
dates13_deaths = mean(dates13$New_deaths)

dates14_cases = mean(dates14$New_cases)
dates14_deaths = mean(dates14$New_deaths)

dates15_cases = mean(dates15$New_cases)
dates15_deaths = mean(dates15$New_deaths)

dates16_cases = mean(dates16$New_cases)
dates16_deaths = mean(dates16$New_deaths)

dates17_cases = mean(dates17$New_cases)
dates17_deaths = mean(dates17$New_deaths)

dates18_cases = mean(dates18$New_cases)
dates18_deaths = mean(dates18$New_deaths)

dates19_cases = mean(dates19$New_cases)
dates19_deaths = mean(dates19$New_deaths)

dates20_cases = mean(dates20$New_cases)
dates20_deaths = mean(dates20$New_deaths)

dates21_cases = mean(dates21$New_cases)
dates21_deaths = mean(dates21$New_deaths)

dates22_cases = mean(dates22$New_cases)
dates22_deaths = mean(dates22$New_deaths)

dates23_cases = mean(dates23$New_cases)
dates23_deaths = mean(dates23$New_deaths)

dates24_cases = mean(dates24$New_cases)
dates24_deaths = mean(dates24$New_deaths)

dates = c("dates1", "dates2", "dates3", "dates4", "dates5", "dates 6", "dates7", "dates8", "dates9", "dates10", "dates11", "dates12", "dates13", "dates14", "dates15", "dates16", "dates17", "dates18", "dates19", "dates20", "dates21", "dates22", "dates23", "dates24")

cases = c(dates1_cases, dates2_cases, dates3_cases, dates4_cases, dates5_cases, dates6_cases, dates7_cases, dates8_cases, dates9_cases, dates10_cases, dates11_cases, dates12_cases, dates13_cases, dates14_cases, dates15_cases, dates16_cases, dates17_cases, dates18_cases, dates19_cases, dates20_cases, dates21_cases, dates22_cases, dates23_cases, dates24_cases)

deaths = c(dates1_deaths, dates2_deaths, dates3_deaths, dates4_deaths, dates5_deaths, dates6_deaths, dates7_deaths, dates8_deaths, dates9_deaths, dates10_deaths, dates11_deaths, dates12_deaths, dates13_deaths, dates14_deaths, dates15_deaths, dates16_deaths, dates17_deaths, dates18_deaths, dates19_deaths, dates20_deaths, dates21_deaths, dates22_deaths, dates23_deaths, dates24_deaths)

anxiety_value = indicators_US.narm$Value
 
compiled_data = data.frame(dates, cases, deaths, anxiety_value)
compiled_data
#Plot showing new cases vs anxiety value 
plot(x=cases, y=anxiety_value, main="cases ~ anxiety_value")  

#Plot showing new deaths vs anxiety value 
plot(x=deaths, y=anxiety_value, main="deaths ~ anxiety_value")
linearMod_cases = lm(log(cases) ~ anxiety_value, data=compiled_data)
summary(linearMod_cases)
linearMod_deaths <- lm(log(deaths) ~ anxiety_value, data=compiled_data)
summary(linearMod_deaths)
covid_US

#subsetting the CDC anxiety data to include only the measures for anxiety or depressive disorder (combined) grouped by race 
indicators
indicators_race = subset(indicators, Indicator == "Symptoms of Anxiety Disorder or Depressive Disorder" & 
                         Group == "By Race/Hispanic ethnicity" )
indicators_race
indicators_race.narm <- indicators_race %>% filter(!is.na(Value))
indicators_race.narm
summary(indicators_race.narm)
anova_race = aov(Value ~ Subgroup, data = indicators_race.narm)
summary(anova_race)

library(emmeans)
emmeans(anova_race, pairwise ~ "Subgroup")
covid_US

#subsetting the CDC anxiety data to include only the measures for anxiety or depressive disorder (combined) grouped by age 
indicators
indicators_age = subset(indicators, Indicator == "Symptoms of Anxiety Disorder or Depressive Disorder" & 
                         Group == "By Age" )
indicators_age
indicators_age.narm <- indicators_age %>% filter(!is.na(Value))
indicators_age.narm
summary(indicators_age.narm)
anova_age = aov(Value ~ Subgroup, data = indicators_age.narm)
summary(anova_age)

emmeans(anova_age, pairwise ~ "Subgroup")
  ggplot(data = compiled_data, aes(x = anxiety_value, y=cases)) +
    geom_point(color = 'blue') +
    geom_smooth(method = "lm", se = FALSE, color = 'red')
  ggplot(data = compiled_data, aes(x = anxiety_value, y=log(cases))) +
    geom_point(color = 'blue') +
    geom_smooth(method = "lm", se = FALSE, color = 'red')
  plot(linearMod_cases, which = 1)
  plot(linearMod_cases, which = 2)
  plot(linearMod_cases, which = 3)
  plot(linearMod_cases, which = 5)
  plot(anova_race, which = 1)
  plot(anova_race, which = 2)
  plot(anova_race, which = 3)
  plot(anova_race, which = 4)
  plot(anova_age, which = 1)
  plot(anova_age, which = 2)
  plot(anova_age, which = 3)
  plot(anova_age, which = 4)